First, let’s check if “GenomicDataCommons”, “TCGAbiolinks”, “maftools”, “bio3d” package are installed. And then, load the packages.
library(GenomicDataCommons)
library(TCGAbiolinks)
library(maftools)
library(bio3d)
We are also going to use these websites:
Muscle (optional)
The National Cancer Institute (NCI) in the U.S. established the Genomic Data Commons (GDC) for sharing cancer genomics data-sets.
Here is what its web portal looks like. Let’s try with p53.
Here are the results.
And click Open in Exploration to explore more information.
First check on GDC status:
GenomicDataCommons::status()
## $commit
## [1] "b18b2385b1e916597856067dc6437f3c20b46bca"
##
## $data_release
## [1] "Data Release 19.0 - September 17, 2019"
##
## $status
## [1] "OK"
##
## $tag
## [1] "1.22.0"
##
## $version
## [1] 1
Let’s find the number of cases/patients across different projects within the GDC. This is included in case(), facet() and aggregations() functions act to group all cases by the project id and then count them up.
cases = cases() %>% facet("project.project_id") %>% aggregations()
cases = cases$project.project_id
Let’s plot the number of cases per project, and highlight “TCGA-PAAD”.
col = rep("lightblue", nrow(cases))
col[which(cases$key=="TCGA-PAAD")] = "red"
par(mar=c(9,4,1,2))
barplot(cases$doc_count, names.arg=cases$key, log="y", col=col, las=2)
Let’s find all gene expression data files for all pancreatic cancer patients.
query = GDCquery(project="TCGA-PAAD",
data.category="Transcriptome Profiling",
data.type="Gene Expression Quantification")
ans = getResults(query)
As of May 2019, there are 546 RNA-Seq data files.
nrow(ans)
## [1] 546
Let’s focus on MAF files and store the MAF file contents in a dataframe.
maf = GDCquery_Maf(tumor="PAAD", pipelines="mutect");
variant = read.maf(maf=maf, verbose=FALSE)
Here is the summary of the maf objects.
plotmafSummary(variant)
Here is the oncoplot, which tells mutation information, by specifying top or genes arguments.
oncoplot(maf=variant, top=10)
oncostrip(maf=variant, genes=c("KRAS", "TP53"))
Or if we are interested in one specific gene, then we can call lollipopPlot(). Here are two examples.
lollipopPlot(maf=variant, gene='KRAS')
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 2 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: KRAS NM_004985 NP_004976 188
## 2: KRAS NM_033360 NP_203524 189
## Using longer transcript NM_033360 for now.
lollipopPlot(maf=variant, gene='TP53')
## Assuming protein change information are stored under column HGVSp_Short. Use argument AACol to override if necessary.
## 8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
## HGNC refseq.ID protein.ID aa.length
## 1: TP53 NM_000546 NP_000537 393
## 2: TP53 NM_001126112 NP_001119584 393
## 3: TP53 NM_001126118 NP_001119590 354
## 4: TP53 NM_001126115 NP_001119587 261
## 5: TP53 NM_001126113 NP_001119585 346
## 6: TP53 NM_001126117 NP_001119589 214
## 7: TP53 NM_001126114 NP_001119586 341
## 8: TP53 NM_001126116 NP_001119588 209
## Using longer transcript NM_000546 for now.
Cancer Vaccine and Immunotherapy is a merging topic in cancer treatment field. The tumor specific mutations that could potentially be used for vaccine development.
In order to achieve that, Comparison of DNA sequences from tumor tissue and normal tissues, but NOT human genome, is important to ensure that the detected differences are somatic mutations, but NOT germline mutations. And, subsequent analysis is to determine whether the mutated region fall into protein coding regions and change the encoded amino acid.
The following sequences resulted from such an NGS analysis of patient healthy and tumor tissue. The FASTA file is saved in the data folder as lecture18_sequences.fa.
>P53_wt Cellular tumor antigen p53 - Healthy Tissue
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
>P53_mutant Cellular tumor antigen p53 - Tumor Tissue
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMLDLMLSPDDIEQWFTEDPGP
DEAPWMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFVHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEV
The lecture18_sequences.fa contains the above sequences.
seqs = read.fasta("data/lecture18_sequences.fa")
# align these sequences if residue position correspondences have not been correctly mapped
#seqs = seqaln(seqs)
Let’s take a look at the aligned sequences. This should be the same as Muscle results as above.
seqs
## 1 . . . . . 60
## P53_wt MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
## P53_mutant MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMLDLMLSPDDIEQWFTEDPGP
## **************************************** *******************
## 1 . . . . . 60
##
## 61 . . . . . 120
## P53_wt DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
## P53_mutant DEAPWMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
## **** *******************************************************
## 61 . . . . . 120
##
## 121 . . . . . 180
## P53_wt SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
## P53_mutant SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
## ************************************************************
## 121 . . . . . 180
##
## 181 . . . . . 240
## P53_wt RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
## P53_mutant RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFVHSVVVPYEPPEVGSDCTTIHYNYMCNS
## ******************************** ***************************
## 181 . . . . . 240
##
## 241 . . . . . 300
## P53_wt SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
## P53_mutant SCMGGMNRRPILTIITLEV-----------------------------------------
## ******************
## 241 . . . . . 300
##
## 301 . . . . . 360
## P53_wt PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
## P53_mutant ------------------------------------------------------------
##
## 301 . . . . . 360
##
## 361 . . . 393
## P53_wt GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
## P53_mutant ---------------------------------
##
## 361 . . . 393
##
## Call:
## read.fasta(file = "data/lecture18_sequences.fa")
##
## Class:
## fasta
##
## Alignment dimensions:
## 2 sequence rows; 393 position columns (259 non-gap, 134 gap)
##
## + attr: id, ali, call
Calculate positional identity scores. This allows us to identify mismatches (< 1).
ide = conserv(seqs$ali, method="identity")
ide
## [1] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [18] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [35] 1.0 1.0 1.0 1.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [52] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 1.0 1.0 1.0
## [69] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [86] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [103] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [120] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [137] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [154] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [171] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [188] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [205] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [222] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [239] 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [256] 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [273] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [290] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [307] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [324] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [341] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [358] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [375] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## [392] 0.5 0.5
Let’s find mismatches and gap positions from the analysis.
mismatch = which(ide < 1)
gaps = gap.inspect(seqs)
After that, we are able to find the mutated positions and corresponding amino acid.
mutant.sites = mismatch[mismatch %in% gaps$f.inds]
mutant.names = paste0(seqs$ali["P53_wt",mutant.sites], mutant.sites, seqs$ali["P53_mutant",mutant.sites])
cbind(mutant.sites, mutant.names)
## mutant.sites mutant.names
## [1,] "41" "D41L"
## [2,] "65" "R65W"
## [3,] "213" "R213V"
## [4,] "259" "D259V"
Now, let’s get the sequences with all possible 9-mers in mutant sequence, and save it to subsequences.fa in the data folder.
start.position = mutant.sites - 8
end.position = mutant.sites + 8
store.seqs = matrix("", nrow=length(mutant.sites), ncol=17)
rownames(store.seqs) = mutant.names
for(i in 1:length(mutant.sites)) {
temp = seqs$ali["P53_mutant",start.position[i]:end.position[i]]
store.seqs[i,] = c(temp[temp!="-"],rep("",sum(temp=="-")))
}
write.fasta(seqs=store.seqs, ids=mutant.names, file="data/subsequences.fa")
To prioritize which of the mutations in a tumor should be included in a vaccine, they can be scanned for those resulting in mutated peptides that bind HLA molecules of the patient with high affinity. We will here use algorithms developed by IEDB HLA binding prediction.
subsequences.fa is generated from above and shown below.
>D41L
SPLPSQAMLDLMLSPDD
>R65W
DPGPDEAPWMPEAAPPV
>R213V
YLDDRNTFVHSVVVPYE
>D259V
ILTIITLEV
Here is the input.
Here is the result. And more are stored in result.csv in the data folder.
We can see the
>HLA-A0201_top_pep
YLDDRNTFV
>HLA-A6801_HLA-B3501_top_pep
FVHSVVVPY
>HLA-B0702_top_pep
SPLPSQAML
Here is the input.
Here are the results for HLA-A0201_top_pep.
Here are the results for HLA-A6801_HLA-B3501_top_pep.
Here are the results for HLA-B0702_top_pep.
Conclusion, we should probably choose FVHSVVVPY peptide.